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2  OBJECTIVE 


The  overall  objective  of  this  work  is  to  obtain  descriptions  of  scenes  from  single  still  images 
or  video  sequences.  From  1993  to  1995  the  project  focused  on  extracting  descriptions  of  curved 
3D  objects  from  aerial  intensity  images.  This  work  was  motivated  by  the  problem  of  reconstruct¬ 
ing  non-rectilinear  structures  such  as  oil  tanks,  cooling  towers  and  domes  in  airborne 

reconnaissance  imagery.  Shadows  of  these  structures  provide  constraints  that  make  the  problem 
tractable. 

In  1996  the  project  focused  on  generating  descriptions  of  the  dynamic  behavior  of  objects  in 
ground-based  video  imagery,  to  address  emerging  needs  for  video  surveillance  in  battlefield 
awareness.  TI  event  recognition  and  video  indexing  capabilities  were  applied  to  outdoor  infrared 
imagery,  and  extended  to  support  construction  of  concise  descriptions  of  events  and  automatic  re¬ 
covery  of  environment  structure  from  observations  of  human  motion. 


3  STATUS  OF  EFFORT 

From  1993  to  1995,  TI  developed  theories  for  shadow  analysis  and  inference  of  3D  shape  of 
curved  objects.  The  theories  support  recovery  of  object  structure  from  the  terminator  and  sweep 
rule  of  the  shadow  ribbon.  The  theory  was  implemented  in  the  experimental  SHADOW  system 
and  demonstrated  on  a  collection  of  aerial  images  of  curved  3D  objects.  At  the  1996 IU  Workshop 
TI  demonstrated  Automatic  Video  Indexing  software  using  live  infrared  data,  illustrating  new  ca¬ 
pabilities  for  video  surveillance.  Subsequent  progress  in  1995-96  was  delayed  by  a  six-month 

funding  gap  while  TI  worked  with  AFOSR  and  DARPA  to  align  this  research  with  emerging  bat¬ 
tlefield  awareness  needs. 

•  a.  During  1^96-1997,  TI  applied  its  existing  Automatic  Video  Indexing  algorithms  to  outdoor 
infrared  images  obtained  under  a  variety  of  illumination  conditions.  This  work  demonstrated  the 
applicability  of  the  algorithms  to  video  surveillance  under  these  conditions,  but  also  revealed  the 
need  for  improved  modeling  of  environmental  change  and  target  behavior  in  order  to  increase  the 
robustness  of  the  system.  TI  also  applied  its  real-time  event  detection  and  tracking  technology  to 
the  problem  of  building  concise  descriptions  of  the  motion  and  appearance  of  people  in  indoor 
scenes.  Finally,  TI  demonstrated  the  feasibility  of  recovering  the  structure  of  navigable  space 
from  long-term  observations  of  human  motion  in  the  environment. 

TI’s  work  on  video  surveillance  and  event  recognition  is  continuing  under  the  joint  sponsor¬ 
ship  of  the  Office  of  Research  and  Development  (ORD)  and  the  DARPA  Image  Understanding 
Program.  & 


4  ACCOMPLISHMENTS/NEW  FINDINGS 

Developed  theoretical  basis  for  recovering  shapes  of  curved  3D  objects  in  aerial  images:  In 
this  work  objects  are  modeled  as  straight  homogeneous  generalized  cylinders  (SHGCs)  SHGCs 
are  solids  formed  by  sweeping  a  polygon  along  a  straight  orthogonal  axis,  allowing  the  polygon  to 
change  scale  according  to  ‘sweep  rule’  as  it  moves.  This  class  of  objects  describes  a  wide  variety 
of  building  shapes,  including  cooling  towers,  oil  tanks,  domes,  pyramids,  and  smokestacks,  as 
weU  as  conventional  rectilinear  buildings  with  flat  roofs.  The  theory  developed  in  this  work  ex- 
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ploits  both  the  outlines  of  the  object  itself  and  the  shadow  that  the  object  casts  under  oblique  solar 
illumination.  The  cross-section  polygon  of  the  SHGC  model  is  derived  from  the  shadow  of  the 
terminating  surface  of  the  object,  and  the  axis  and  sweep  rule  of  the  SHGC  are  derived  from  the 
length,  axis  and  two-dimensional  sweep  rule  of  the  shadow. 

Derived  quasi-invariant  constant  in  the  projective  geometry  of  circles:  Under  orthographic 
projection  a  circular  object  such  as  the  roof  of  an  oil  tank  is  imaged  as  an  ellipse.  The  transition 
points  of  the  ellipse  are  points  along  the  boundary  of  the  ellipse  at  which  the  boundary  curvature 
switches  from  being  less  than  that  of  the  generating  circle  to  being  more  than  that  of  the  generat¬ 
ing  circle.  In  the  course  of  developing  the  theory  of  shadow  formation,  it  was  discovered  that  the 
angle  between  these  transition  points  and  the  major  axis  approaches  a  limiting  value  of 

acos  W3J2  54'73°  as  the  tUt  of  the  circIe  approaches  zero.  This  constant  is  quasi-invariant  in  that  the 
transition  point  moves  only  slightly  over  wide  range  of  viewing  angles. 

Completed  SHADOW  software  for  describing  curved  3-D  objects:  Using  theoretical  results 
from  this  research,  TI  developed  an  experimental  software  system  that  automatically  processes  an 
image  of  an  oil  tank  from  an  aerial  photograph.  For  cylindrical  objects,  the  software  segments  the 
object-shadow-background,  finds  the  shadow  length  and  height,  and  produces  a  texture-mapped 
display  of  the  inferred  3-D  object. 

Demonstrated  Automatic  Video  Indexing  of  outdoor  infrared  surveillance  video:  TI  collect¬ 
ed  a  set  of  six  infrared  video  sequences  of  human  activity  under  a  variety  of  imaging  conditions. 
These  sequences  were  analyzed  using  TI’s  Automatic  Video  Indexing  (AVI)  software  [Courtney 
1997]  and  results  and  error  rates  were  recorded.  These  experiments  revealed  that  the  limiting  fac¬ 
tor  in  the  system’s  performance  is  its  ability  to  distinguish  humans,  vehicles,  and  other  objects  of 
interest  from  other  sources  of  image  change.  This  is  particularly  a  problem  at  high  ambient  tem¬ 
peratures  where  the  thermal  contrast  between  humans  and  other  objects  is  very  low.  When 
segmentation  is  successful,  the  tracking  and  event  recognition  algorithms  have  no  difficulty  de¬ 
tecting  events  such  as  entry  or  exit  from  the  scene  or  deposit  and  removal  of  objects.  Figure  1  of 
[Flinchbaugh  1997]  (attachment  A)  shows  an  example  in  which  the  system  successfully  detected 
an  intruder  emerging  from  a  tree-lined  gully. 

Developed  long-term  monitoring  system  with  algorithms  for  best  view  selection:  Using  a  pre¬ 
viously  developed  real-time  tracking  and  event  recognition  system,  TI  developed  algorithms  and 
data  representations  for  long-term  monitoring  of  human  activity.  The  algorithms  and 
representations  are  embodied  in  a  system  that  takes  one  snapshot  of  each  person  who  enters  its 
field  of  view,  and  stores  it  in  a  database  along  with  information  about  time  of  entry,  participation 
in  selected  events,  and  path  through  the  scene.  The  system  selects  a  snapshot  using  criteria  that 
favor  ‘good’  views  of  the  person,  i.e.  views  in  which  the  person’s  face  is  visible  and  they  are  close 
to  the  camera.  The  database  is  accessed  over  computer  network  using  a  conventional  Web  brows¬ 
er,  making  it  easy  to  review  activity  in  the  monitored  region.  The  method  and  experimental  results 
are  detailed  in  attachment  B. 

Demonstrated  environment  structure  learning  from  observations  of  human  activity:  TI  ap¬ 
plied  its  real-time  tracking  system  to  the  problem  of  recovering  the  structure  of  the  monitored 
environment.  The  system  observes  human  activity  over  a  long  period  (24  hours  in  our  experi¬ 
ments)  and  records  the  correlation  between  the  projected  size  of  humans  and  their  positions  in  the 
image.  Given  that  humans  have  a  known  height  distribution  and  that  they  require  a  solid  surface  to 
stand  on,  these  observations  make  it  possible  to  infer  both  the  existence  of  solid  surfaces  and  their 
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distance  from  the  camera.  This  allows  a  surveillance  camera  to  discover  the  structure  of  the  envi¬ 
ronment  it  is  monitoring  without  calibration  or  an  externally  supplied  map.  The  method  and 
experimental  results  are  detailed  in  attachment  B. 

5  PERSONNEL  SUPPORTED 

The  primary  contributors  are: 

Dr.  Thomas  J.  Olson  (part-time,  7/96-present) 

Dr.  Kashi  Rao  (part-time,  10/93-12/95) 

Other  contributors  include: 

Dr.  Frank  Z.  Brill 

Mr.  Jonathan  D.  Courtney 

Dr.  Bruce  E.  Flinchbaugh 

6  PUBLICATIONS 

A)  SUBMITTED  BUT  NOT  YET  ACCEPTED:  None 

B)  ACCEPTED  BUT  NOT  YET  PUBLISHED:  None 

C)  PUBLISHED: 

Courtney,  J.  “Automatic  Video  Indexing  via  Object  Motion  Analysis”,  Pattern  Recognition 
v.  30  no.  4,  April  1997 

Flinchbaugh,  B.,  “Robust  Video  Motion  Detection  and  Event  Recognition”,  in  Proc.  1997 
DARPA  Image  Understanding  Workshop,  New  Orleans,  May  1997. 

Flinchbaugh,  B.,  and  T.  Olson,  “Autonomous  Video  Surveillance”,  in  Emerging  Applications 

of  Computer  Vision ,  D.  Schaefer  and  E.  Williams,  eds.,  Proc.  SPIE2962,  Washington  DC  Octo¬ 
ber  1996.  " 

Flinchbaugh,  B.  and  K.  Rao,  “Image  Understanding  Research  at  TI”,  in  Proc.  1994  DARPA 
Image  Understanding  Workshop,  Monterey,  November  1994. 

Olson,  T.,  and  F.  Brill.  “Moving  Object  Detection  and  Event  Recognition  Algorithms  for 
19971  <“ameras”’  *n  ^roc-  1997  DARPA  Image  Understanding  Workshop,  New  Orleans,  May 

Rao,  K.,  and  P.  Sarwal.  “A  Computer  Vision  System  to  Detect  3-D  Rectangular  Solids”,  in 
Proc.  Third  IEEE  Workshop  on  Applications  of  Computer  Vision,  Sarasota,  Florida,  pp.  27-32  De¬ 
cember  1996. 

Rao,  K.  “Shape  Description  of  Curved  3-D  Objects  for  Aerial  Surveillance”,  in  Proc.  1996 
DARPA  Image  Understanding  Workshop,  Palm  Springs,  February  1996. 

Rao,  K.  “Curved  3-D  Object  Description  from  Single  Aerial  Images  Using  Shadows”,  in 
Proc.  1994  DARPA  Image  Understanding  Workshop,  Monterey,  November  1994. 


4 


pa  /Ra°’  ^7 B'  Flmchbaugh’  “Vision  Research  at  TI:  1994-95  Progress”  in  Proc  19Q6  ha  p 
PA  Image  Understanding  Workshop,  Palm  Springs,  February  1996.  ’  R' 

7  INTERACTIONS/TRANSITIONS 

nars%tcCIPATION/PR]ESENTATIONS  at  meetings,  conferences,  semi- 

Bnll,  F.,  Autonomous  Video  Monitoring”,  Demonstration  at  CVPR  97,  San  Juan,  June  1997 
Uon,  MSay"i m™"  ^  ^  ACti°"  Percer,ti0n''-  NSF/DARPA  Workshop  on  Perception  of  Ac  - 

for  TEFFCrv,mn.h;  NeedS  f°r  ComPuter  Visio"  and  Pattern  Recognition  ”  Panel  Chair 

for  IEEE  Computer  Vision  &  Pattern  Recognition  Conference,  June  1996. 

ogniUonConfetmtcei  June  ^  **  EEE  Coa^a  &  Pattern  Rec- 

Processing  Wotehop,  ‘"6  Signal 

flte  i™  ImageIUnden:t^ding<WorfelK>p^P^neSpri^s:,ri^bruary"l99^em0nStral"0n  “ 
ratofyUrdvenfty  “ “  C"  V™”  Labo- 

B)  CONSULTATIVE  AND  ADVISORY  FUNCTIONS 
Dr.  Thomas  Olson  served  on: 

NSF/DARPA  Workshop  on  Action  Perception,  Brewster,  MA,  May  1997. 

February,  ^996.  ProPosaI  Evaluation  Committee,  Chair:  Maria  Zemankova  (NSF)  Arlington,  VA, 

Dr.  Bruce  Flinchbaugh  serves  on: 

Industrial  Liaison  Committee,  IAPR,  1992-. 

Computer  Science  Advisory  Board,  The  Ohio  State  University,  1995-. 

External  Advisory  Board,  Beckman  Image  Laboratory,  University  of  Illinois  1996- 

Program  Chair,  Third  IEEE  Workshop  on  Applications  of  Computer  Vision,  l’994 
Program  Committee,  CVPR  95. 

Program  Committee,  Fourth  IEEE  Workshop  on  Applications  of  Computer  Vision  1996 

C)  TRANSITIONS 

Demonstration  Led  to  ORD  Autonomous  Video  Surveillance  Program-  At  the  1996  happa 

«».  «d 


5 


concept  for  real-time  security  monitoring  of  office  building  environments,  using  enabling  technol¬ 
ogy  developed  in  TI  IR&D.  This  led  to  the  creation  of  the  real-time  tracking  and  event  recognition 
system  used  m  part  of  this  research.  6 


Long-Term  Monitoring  and  Best  View  Selection  to  be  Demonstrated  at  ORD:  Dr  Yeongii 
Kim  has  expressed  interest  in  the  possibility  of  applying  the  long-term  monitoring  and  best  view 
selection  capability  developed  under  this  contract  to  intelligence  needs,  and  has  requested  that  it 
be  demonstrated  to  potential  customers  at  an  ORD  site  as  part  of  an  ORD  contract  demonstration 
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Robust  Video  Motion  Detection  and  Event  Recognition 

Bruce  Flinchbaugh 

Texas  Instruments 
Research  &  Development 
RO.  Box  655303,  MS  8374 
Dallas,  Texas  75265 
flinchbaugh@ti.com 


Abstract 

This  report  summarizes  recent  progress  in  video 
event  recognition  technology  for  automatically 
monitoring  scenes,  cind  outlines  objectives  of  new 
research  to  improve  reliability  and  extend  the  func¬ 
tionality.  TT  has  demonstrated  an  event  recognition 
capability  that  automatically  processes  video  data 
at  10-20  frames  per  second  and  reports  the  events 
as  they  occur  during  long  periods  of  observation. 
For  example,  as  people,  vehicles  and  objects  move 
in  the  field  of  view,  the  system  recognizes  when 
entities  enter  and  exit  the  scene,  when  a  person  de¬ 
posits  an  object,  when  overall  imaging  conditions 
change,  and  when  someone  loiters  in  a  specified 
area.  The  system  has  been  demonstrated  using  an 
infrared  video  camera  in  darkness  and  CCD  camer¬ 
as  in  lighted  areas.  Ongoing  research  is  enhancing 
the  reliability  of  video  motion  analysis  methods  for 
robust  performance  in  outdoor  environments,  and 
extending  event  recognition  functionality  for  new 
kinds  of  events.  This  research  will  enable  net¬ 
worked  smart  cameras  for  autonomous  situational 
awareness  of  site  perimeters,  battlefields  and  other 
urban  and  rural  areas  where  physical  security  and 
safety  are  primary  concerns. 

1  Research  Objectives 

The  overall  objective  of  this  research  is  to  develop 
and  demonstrate  new  video  processing  methods  for 
automatically  monitoring  scenes.  Whereas  cameras 
of  today  deliver  images  and  video  data,  smart  cam- 
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eras  of  the  future  will  deliver  information  derived 
from  video  data.  These  smart  cameras  will  commu¬ 
nicate  via  local  and  wide  area  networks  to  enable 
many  new  capabilities.  For  defense  needs,  smart 
cameras  will  autonomously  deliver  information 
about  live  events  to  distributed  information  sys¬ 
tems  that  support  battlefield  awareness  in  urban 
and  rural  environments.  Smart  cameras  will  effec¬ 
tively  extend  the  sight  of  commanders  to  remote 
areas  by  accurately  drawing  attention  to  important 
events  in  progress. 

Specific  goals  are  to  develop  video  surveillance 
and  monitoring  methods  to  recognize  new  kinds  of 
events,  to  improve  the  reliability  of  the  moving  ob¬ 
ject  analysis  process,  and  to  demonstrate  effective¬ 
ness  of  the  new  methods  in  performing  important 
tasks.  New  event  recognition  methods  will  classify 
motions  and  interactions  of  objects  into  custom 
categories  that  are  important  for  mission-specific 
tasks.  Robust  moving  object  detection  and  tracking 
is  needed  to  interpret  significant  changes  in  video 
sequences  as  entities  move  in  the  field  of  view,  es¬ 
pecially  amidst  video  changes  caused  by  variations 
in  illumination,  temperature,  wind,  and  occlusions. 

2  Demonstration  and  Evaluation 

Proof-of-concept  demonstrations  will  emphasize 
physical  security  monitoring  tasks  in  and  around 
urban  area  buildings.  The  outdoor  experiments  will 
be  of  particular  importance  for  battlefield  aware¬ 
ness.  For  example,  the  infrared  image  of  Figure  1 


More  information  about  this  research  is  available  at: 
http://www.ti.com/research/docs/iuba/index.html 


shows  a  rural  monitoring  scenario  in  which  a  per¬ 
son  has  emerged  from  behind  a  tree  and  is  walking 
across  a  grassy  area.  Exemplary  tasks  in  this  sce¬ 
nario  are  to  reliably  determine  when  a  person  is  in 
the  field  of  view,  and  to  count  the  number  of  people 
who  cross  the  field.  To  achieve  practical  demon¬ 
stration  goals,  a  variety  of  open-ended  research  is¬ 
sues  must  be  resolved  to  some  extent.  What  kinds 
of  events  can  be  recognized  using  a  single  video 
camera?  What  contextual  information  is  needed  for 
reliable  video  monitoring  in  a  given  situation?  This 
research  will  contribute  new  insight  while  develop¬ 
ing  new  functionality  for  smart  cameras  of  the  fu¬ 
ture. 


Realistic  video  monitoring  tasks  will  be  used  to 
test  new  techniques  for  robust  moving  object  de¬ 
tection  and  event  recognition,  with  two  kinds  of 
metrics  for  evaluating  progress.  Physical  security 
monitoring  experts  will  be  consulted  to  select 
worthwhile  new  events  to  recognize,  and  to  provide 
feedback  about  the  quality  of  system  performance 
compared  to  current  practice.  This  evaluation  will 
identify  operational  advantages  of  autonomous 
video  event  recognition  systems.  The  primary 
quantitative  metrics  for  characterizing  performance 
are  the  error  rates  of  event  recognition  reports.  For 
example,  if  the  task  is  to  capture  a  single  frontal 
view  image  of  each  person  who  loiters  in  a  speci¬ 
fied  area,  then  non-frontal  images,  extra  frontal  im¬ 
ages,  and  no  frontal  image  of  a  loitering  person 
would  contribute  to  the  error  rate. 


Figure  1.  Autonomous  video  monitoring  of  remote 
areas  draws  attention  to  important  events  in 
progress. 


3  Autonomous  Video  Surveillance  Progress 

In  previous  TI  research  [Flinchbaugh  and  Olson, 
1996],  several  video  monitoring  techniques  were 
devised  to  demonstrate  feasibility  of  tracking  peo¬ 
ple  and  marking  their  positions  on  a  map  display 
[Flinchbaugh  and  Bannon  1994],  recognizing 
whether  a  person  is  holding  a  box  [Rao  and  Sarwal, 

1996] ,  and  recognizing  some  basic  actions  or 
events  (enter,  exit,  deposit,  remove,  move,  rest)  of 
people  and  objects  in  the  field  of  view  [Courtney 

1997] , 


During  the  past  year,  an  Autonomous  Video  Sur¬ 
veillance  (AVS)  system  [Olson  and  Brill,  1997]  has 
been  developed  that  integrates  the  previous  tech¬ 
niques  for  the  first  time,  and  provides  several  new 
integrated  capabilities  to  monitor  TV  and  infrared 
video  cameras: 

Calibration-Free  Image-to-World  Mapping: 

After  an  operator  specifies  approximate  correspon¬ 
dences  between  selected  image  regions  and  map 
regions,  the  system  estimates  3D  locations  of  ob¬ 
jects  in  the  field  of  view  without  solving  for  the 
camera  projection  matrix  or  internal  calibration  pa¬ 
rameters. 

User  Interface  for  Multiple  Cameras:  The  map- 
based  user  interface  has  been  extended  to  operate 
as  a  server  for  multiple  video  processors,  allowing 
the  operator  to  visually  monitor  tracks  and  event 
reports  from  multiple  cameras,  as  positions  of  peo¬ 
ple  and  objects  are  dynamically  plotted  on  a  map. 

Object  Analysis:  The  system  classifies  objects 
that  have  been  deposited  in  a  scene  as  one  of  sever¬ 
al  known  object  types  (e.g.,  box,  briefcase,  and 
notebook)  or  as  an  unknown  object. 

Contextual  Alarms:  The  alarm  monitoring  sys¬ 
tem  allows  alarms  to  be  conditioned  on  type  of 
event,  location,  time  of  day,  and  the  type  of  object 
involved  in  the  event. 

Best- View  Selection:  This  method  assesses  the 
relative  quality  of  two  views  of  a  person  in  a  video 
sequence.  This  allows  a  video  monitoring  system 
to  select  and  save  a  single  high-quality  digital 
snapshot  of  each  person  that  enters  the  field  of 
view. 


Real-Time  Operation  Without  Special  Hard¬ 
ware:  All  of  the  above  capabilities  except  object 
analysis  run  at  10-20  frames  per  second  on  a  con¬ 
ventional  workstation.  This  capability  enables 
long-term  experiments  that  were  previously  not 
feasible,  and  improves  tracking  and  event  recogni¬ 
tion  reliability. 

The  AVS  system  has  been  used  to  demonstrate  fea¬ 
sibility  of  generating  real-time  alarms  for  specified 
events  in  three  security  monitoring  scenarios. 
These  demonstrations  illustrate  how  physical  secu¬ 
rity  can  be  partially  automated  to  monitor  hallway, 
office,  and  building  perimeter  areas.  In  each  area,  a 
camera  provides  live  video  data  of  scenes  in  the 
field  of  view,  while  the  AVS  system  monitors  the 
video  to  analyze  events  and  signal  alarms. 

Hallway  Monitoring:  Consider  the  scenario  illus¬ 
trated  in  Figure  2.  The  AVS  system  detects  and 
tracks  people  as  they  walk  in  office  building  hall¬ 
ways.  Alarms  are  interactively  defined  for  condi¬ 
tions  such  as  when  someone  loiters  in  a  specified 
area  or  enters  a  particular  office.  Autonomous  visu¬ 
al  assessment  provides  information  to  augment 
other  information,  such  as  biometric  access  control 
information  at  building  entrance  points. 


Figure  2.  In  a  hallway  monitoring  demonstration, 
the  AVS  system  tracks  people  and  signals  an  alarm 
when  someone  loiters  in  a  specified  area. 


Room  Monitoring:  For  the  room  monitoring  sce¬ 
nario  shown  in  Figure  3,  the  AVS  system  maintains 
a  situational  awareness  record  of  events  and  signals 
alarms  for  a  variety  of  specified  conditions.  For  ex¬ 
ample,  an  alarm  may  be  specified  for  events  in 
which  a  person  places  a  briefcase  on  a  table,  but 


not  if  the  person  leaves  a  box  on  the  floor.  Using 
contextual  information  such  as  time  of  day  and  ac¬ 
cess  control  identification,  the  system  can  report 
other  alarm  conditions  that  are  functions  of  who  is 
in  the  room  and  when. 


Perimeter  Monitoring:  For  perimeter  monitor¬ 
ing  scenarios,  an  infrared  camera  is  used  in  a  dark 
area  to  provide  video  data  to  AVS,  illustrating  the 
ability  to  monitor  areas  outside  buildings  at  night. 
For  example,  the  AVS  system  could  monitor  a 
building  entrance  and  signal  an  alarm  if  someone 
walks  by  and  leaves  an  object  outside  the  door,  as 
illustrated  in  Figure  4),  but  not  if  someone  loiters 
without  placing  an  object  on  the  ground. 


Figure  3.  Automatic  room  monitoring  provides 
concise  reports  of  activities  in  the  field  of  view 


Figure  4.  An  outdoor  site  perimeter  surveillance 
scenario  involves  an  infrared  video  camera  to 
recognize  events  in  darkness 


Acknowledgments 

Frank  Brill  and  Tom  Olson  developed  the  new  ca¬ 
pabilities  described  in  this  report. 

References 

[Courtney,  1997]  J.  D.  Courtney.  Automatic  video 
indexing  via  object  motion  analysis.  Pattern 
Recognition,  30(4),  April  1997. 

[Flinchbaugh  and  Bannon,  1994]  B.  Flinchbaugh, 
and  T.  Bannon.  Autonomous  scene  monitoring 
system.  In  Proc.  10th  Annual  Joint  Govern¬ 
ment-Industry  Security  Tech.  Symposium, 
American  Defense  Preparedness  Association, 
June  1994. 


[Flinchbaugh  and  Olson,  1996]  B.  Flinchbaugh 
and  T.  Olson.  Autonomous  video  surveillance. 
In  Proceedings  of  the  SPIE:  Emerging  Appli¬ 
cations  of  Computer  Vision,  D.  Schaefer  and  E. 
Williams,  editors,  Washington,  D.C.,  vol. 
2962,  pp.  144-153,  October  1996. 

[Olson  and  Brill,  1997]  T.  J.  Olson  and  F.  Z.  Brill. 
Moving  object  detection  and  event  recognition 
for  smart  cameras.  In  1997  Proceedings  of  the 
DARPA  Image  Understanding  Workshop,  Mor- 
gan-Kaufman,  May  1997. 

[Rao  and  Sarwal,  1996]  K.  Rao,  and  P.  Sarwal.  A 
computer  vision  system  to  detect  3-D  rectan¬ 
gular  objects.  In  Proceedings  of  the  Third 
IEEE  Workshop  on  Applications  of  Computer 
Vision,  pp.  27-32,  1996. 


Attachment  B 


Proceedings  of  the  1997  Image  Understanding 
Workshop,  New  Orleans,  pp.  159-175,  May  1997. 


Moving  Object  Detection  and  Event  Recognition  Algorithms  for  Smart  Cameras 

Thomas  J.  Olson 
Frank  Z.  Brill 

Texas  Instruments 
Research  &  Development 
P.O.  Box  655303,  MS  8374,  Dallas,  TX  75265 
E-mail:  olson@csc.ti.com,  brill@ti.com 
http://www.ti.com/research/docs/iuba/index.html 


Abstract 

Smart  video  cameras  analyze  the  video  stream  and 
translate  it  into  a  description  of  the  scene  in  terms 
of  objects,  object  motions,  and  events.  This  paper 
describes  a  set  of  algorithms  for  the  core  computa¬ 
tions  needed  to  build  smart  cameras.  Together 
these  algorithms  make  up  the  Autonomous  Video 
Surveillance  (AVS)  system,  a  general-purpose 
framework  for  moving  object  detection  and  event 
recognition.  Moving  objects  are  detected  using 
change  detection,  and  are  tracked  using  first-order 
prediction  and  nearest  neighbor  matching.  Events 
are  recognized  by  applying  predicates  to  the  graph 
formed  by  linking  corresponding  objects  in  succes¬ 
sive  frames  .The  AVS  algorithms  have  been  used  to 
create  several  novel  video  surveillance  applica¬ 
tions.  These  include  a  video  surveillance  shell  that 
allows  a  human  to  monitor  the  outputs  of  multiple 
cameras,  a  system  that  takes  a  single  high-quality 
snapshot  of  every  person  who  enters  its  field  of 
view,  and  a  system  that  leams  the  structure  of  the 
monitored  environment  by  watching  humans  move 
around  in  the  scene. 

1  Introduction 

Video  cameras  today  produce  images,  which  must 
be  examined  by  humans  in  order  to  be  useful.  Fu¬ 
ture  ‘smart’  video  cameras  will  produce  infor¬ 
mation,  including  descriptions  of  the  environment 
they  are  monitoring  and  the  events  taking  place  in 
it.  The  information  they  produce  may  include  im- 

The  research  described  in  this  report  was  sponsored  in  part  by 
the  DARPA  Image  Understanding  Program. 


ages  and  video  clips,  but  these  will  be  carefully 
selected  to  maximize  their  useful  information  con¬ 
tent.  The  symbolic  information  and  images  from 
smart  cameras  will  be  filtered  by  programs  that  ex¬ 
tract  data  relevant  to  particular  tasks.  This  filtering 
process  will  enable  a  single  human  to  monitor  hun¬ 
dreds  or  thousands  of  video  streams. 

In  pursuit  of  our  research  objectives  [Flinchbaugh, 
1997],  we  are  developing  the  technology  needed  to 
make  smart  cameras  a  reality.  Two  fundamental  ca¬ 
pabilities  are  needed.  The  first  is  the  ability  to 
describe  scenes  in  terms  of  object  motions  and  in¬ 
teractions.  The  second  is  the  ability  to  recognize 
important  events  that  occur  in  the  scene,  and  to 
pick  out  those  that  are  relevant  to  the  current  task. 
These  capabilities  make  it  possible  to  develop  a  va¬ 
riety  of  novel  and  useful  video  surveillance 
applications. 

1.1  Video  Surveillance  and  Monitoring 
Scenarios 

Our  work  is  motivated  by  a  several  types  of  video 
surveillance  and  monitoring  scenarios. 

Indoor  Surveillance:  Indoor  surveillance  provides 
information  about  areas  such  as  building  lobbies, 
hallways,  and  offices.  Monitoring  tasks  in  lobbies 
and  hallways  include  detection  of  people  deposit¬ 
ing  things  (e.g.,  unattended  luggage  in  an  airport 
lounge),  removing  things  (e.g.,  theft),  or  loitering. 
Office  monitoring  tasks  typically  require  informa¬ 
tion  about  people’s  identities:  in  an  office,  for 
example,  the  office  owner  may  do  anything  at  any 


time,  but  other  people  should  not  open  desk  draw¬ 
ers  or  operate  the  computer  unless  the  owner  is 
present.  Cleaning  staff  may  come  in  at  night  to  vac¬ 
uum  and  empty  trash  cans,  but  should  not  handle 
objects  on  the  desk. 

Outdoor  Surveillance:  Outdoor  surveillance  in¬ 
cludes  tasks  such  as  monitoring  a  site  perimeter  for 
intrusion  or  threats  from  vehicles  (e.g.,  car  bombs). 
In  military  applications,  video  surveillance  can 
function  as  a  sentry  or  forward  observer,  e.g.  by 
notifying  commanders  when  enemy  soldiers 
emerge  from  a  wooded  area  or  cross  a  road. 

In  order  for  smart  cameras  to  be  practical  for  real- 
world  tasks,  the  algorithms  they  use  must  be  ro¬ 
bust.  Current  commercial  video  surveillance 
systems  have  a  high  false  alarm  rate  [Ringler  and 
Hoover,  1995],  which  renders  them  useless  for 
most  applications.  For  this  reason,  our  research 
stresses  robustness  and  quantification  of  detection 
and  false  alarm  rates.  Smart  camera  algorithms 
must  also  run  effectively  on  low-cost  platforms,  so 
that  they  can  be  implemented  in  small,  low-power 
packages  and  can  be  used  in  large  numbers.  Study¬ 
ing  algorithms  that  can  run  in  near  real  time  makes 
it  practical  to  conduct  extensive  evaluation  and 
testing  of  systems,  and  may  enable  worthwhile 
near-term  applications  as  well  as  contributing  to 
long-term  research  goals. 

1.2  Approach 

The  first  step  in  processing  a  video  stream  for  sur¬ 
veillance  purposes  is  to  identify  the  important 
objects  in  the  scene.  In  this  paper  it  is  assumed  that 
the  important  objects  are  those  that  move  indepen¬ 
dently.  Camera  parameters  are  assumed  to  be  fixed. 
This  allows  the  use  of  simple  change  detection  to 
identify  moving  objects.  Where  use  of  moving 
cameras  is  necessary,  stabilization  hardware  and 
stabilized  moving  object  detection  algorithms  can 
be  used  (e.g.  [Burt  et  al,  1989,  Nelson,  1991],  The 
use  of  criteria  other  than  motion  (e.g.,  salience 
based  on  shape  or  color,  or  more  general  object 
recognition)  is  compatible  with  our  approach,  but 
these  criteria  are  not  used  in  our  current 
applications. 

Our  event  recognition  algorithms  are  based  on 
graph  matching.  Moving  objects  in  the  image  are 


tracked  over  time.  Observations  of  an  object  in  suc¬ 
cessive  video  frames  are  linked  to  form  a  directed 
graph  (the  motion  graph).  Events  are  defined  in 
terms  of  predicates  on  the  motion  graph.  For  in¬ 
stance,  the  beginning  of  a  chain  of  successive 
observations  of  an  object  is  defined  to  be  an  EN¬ 
TER  event.  Event  detection  is  described  in  more 
detail  below. 

Our  approach  to  video  surveillance  stresses  2D, 
image-based  algorithms  and  simple,  low-level  ob¬ 
ject  representations  that  can  be  extracted  reliably 
from  the  video  sequence.  This  emphasis  yields  a 
high  level  of  robustness  and  low  computational 
cost.  Object  recognition  and  other  detailed  analy¬ 
ses  are  used  only  after  the  system  has  determined 
that  the  objects  in  question  are  interesting  and  mer¬ 
it  further  investigation. 

1.3  Research  Strategy 

The  primary  technical  goal  of  this  research  is  to  de¬ 
velop  general-purpose  algorithms  for  moving 
object  detection  and  event  recognition.  These  algo¬ 
rithms  comprise  the  Autonomous  Video 
Surveillance  (AVS)  system,  a  modular  framework 
for  building  video  surveillance  applications.  AVS 
is  designed  to  be  updated  to  incorporate  better  core 
algorithms  or  to  tune  the  processing  to  specific  do¬ 
mains  as  our  research  progresses. 

In  order  to  evaluate  the  AVS  core  algorithms  and 
event  recognition  and  tracking  framework,  we  use 
them  to  develop  applications  motivated  by  the  sur¬ 
veillance  scenarios  described  above.  The 
applications  are  small-scale  implementations  of  fu¬ 
ture  smart  camera  systems.  They  are  designed  for 
long-term  operation,  and  are  evaluated  by  allowing 
them  to  run  for  long  periods  (hours  or  days)  and 
analyzing  their  output. 

The  remainder  of  this  paper  is  organized  as  fol¬ 
lows.  The  next  section  discusses  related  work. 
Section  3  presents  the  core  moving  object  detection 
and  event  recognition  algorithms,  and  the  mecha¬ 
nism  used  to  establish  the  3D  positions  of  objects. 
Section  4  presents  applications  that  have  been  built 
using  the  AVS  framework.  The  final  section  dis¬ 
cusses  the  current  state  of  the  system  and  our 
future  plans. 


2  Related  Work 

Our  overall  approach  to  video  surveillance  has 
been  influenced  by  interest  in  selective  attention 
and  task-oriented  processing  [Swain  and  Strieker, 
1991,  Rimey  and  Brown,  1993,  Camus  et  al., 
1993],  The  fundamental  problem  with  current  vid¬ 
eo  surveillance  technology  is  that  the  useful 
information  density  of  the  images  delivered  to  a 
human  is  very  low;  the  vast  majority  of  surveil¬ 
lance  video  frames  contain  no  useful  information 
at  all.  The  fundamental  role  of  the  smart  camera 
described  above  is  to  reduce  the  volume  of  data 
produced  by  the  camera,  and  increase  the  value  of 
that  data.  It  does  this  by  discarding  irrelevant 
frames,  and  by  expressing  the  information  in  the 
relevant  frames  primarily  in  symbolic  form. 

2.1  Moving  Object  Detection 

Most  algorithms  for  moving  object  detection  using 
fixed  cameras  work  by  comparing  incoming  video 
frames  to  a  reference  image,  and  attributing  signifi¬ 
cant  differences  either  to  motion  or  to  noise.  The 
algorithms  differ  in  the  form  of  the  comparison  op¬ 
erator  they  use,  and  in  the  way  in  which  the 
reference  image  is  maintained.  Simple  intensity 
differencing  followed  by  thresholding  is  widely 
used  [Jain  et  al.,  1979,  Yalamanchili  et  al.,  1982, 
Kelly  et  al.,  1995,  Bobick  and  Davis,  1996,  Court¬ 
ney,  1997]  because  it  is  computationally 
inexpensive  and  works  quite  well  in  many  indoor 
environments.  Some  algorithms  provide  a  means  of 
adapting  the  reference  image  over  time,  in  order  to 
track  slow  changes  in  lighting  conditions  and/or 
changes  in  the  environment  [Karmann  and  von 
Brandt,  1990,  Makarov,  1996a].  Some  also  filter 
the  image  to  reduce  or  remove  low  spatial  frequen¬ 
cy  content,  which  again  makes  the  detector  less 
sensitive  to  lighting  changes  [Makarov  et  al., 
1996b,  Koller  et  al.,  1994], 

Recent  work  [Pentland,  1996,  Kahn  et  al.,  1996] 
has  extended  the  basic  change  detection  paradigm 
by  replacing  the  reference  image  with  a  statistical 
model  of  the  background.  The  comparison  operator 
becomes  a  statistical  test  that  estimates  the  proba¬ 
bility  that  the  observed  pixel  value  belongs  to  the 
background. 


Our  baseline  change  detection  algorithm  uses 
thresholded  absolute  differencing,  since  this  works 
well  for  our  indoor  surveillance  scenarios.  For  ap¬ 
plications  where  lighting  change  is  a  problem,  we 
use  the  adaptive  reference  frame  algorithm  of  Kar¬ 
mann  and  von  Brandt  [1990],  We  are  also 
experimenting  with  a  probabilistic  change  detector 
similar  to  Pfinder  [Pentland,  1996. 

Our  work  assumes  fixed  cameras.  When  the  cam¬ 
era  is  not  fixed,  simple  change  detection  cannot  be 
used  because  of  background  motion.  One  approach 
to  this  problem  is  to  treat  the  scene  as  a  collection 
of  independently  moving  objects,  and  to  detect  and 
ignore  the  visual  motion  due  to  camera  motion 
[e.g.  Burt  et  al.,  1989]  Other  researchers  have  pro¬ 
posed  ways  of  detecting  features  of  the  optical  flow 
that  are  inconsistent  with  a  hypothesis  of  self  mo¬ 
tion  [Nelson,  1991]. 

In  many  of  our  applications  moving  object  detec¬ 
tion  is  a  prelude  to  person  detection.  There  has 
been  significant  recent  progress  in  the. development 
of  algorithms  to  locate  and  track  humans.  Pfinder 
(cited  above)  uses  a  coarse  statistical  model  of  hu¬ 
man  body  geometry  and  motion  to  estimate  the 
likelihood  that  a  given  pixel  is  part  of  a  human. 
Several  researchers  have  described  methods  of 
tracking  human  body  and  limb  movements  [Gavri- 
la  and  Davis,  1996,  Kakadiaris  and  Metaxas,  1996] 
and  locating  faces  in  images  [Sung  and  Poggio, 
1994,  Rowley  et  al.,  1996],  Intille  and  Bobick 
[1995]  describe  methods  of  tracking  humans 
through  episodes  of  mutual  occlusion  in  a  highly 
structured  environment.  We  do  not  currently  make 
use  of  these  techniques  in  live  experiments  because 
of  their  computational  cost.  However,  we  expect 
that  this  type  of  analysis  will  eventually  be  an  im¬ 
portant  part  of  smart  camera  processing. 

2.2  Event  Recognition 

Most  work  on  event  recognition  has  focussed  on 
events  that  consist  of  a  well-defined  sequence  of 
primitive  motions.  This  class  of  events  can  be  con¬ 
verted  into  spatiotemporal  patterns  and  recognized 
using  statistical  pattern  matching  techniques.  A 
number  of  researchers  have  demonstrated  algo¬ 
rithms  for  recognizing  gestures  and  sign  language 
[e.g.,  Stamer  and  Pentland,  1995].  Bobick  and 
Davis  [  1 996]  describe  a  method  of  recognizing  ste- 
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reotypical  motion  patterns  corresponding  to 
actions  such  as  sitting  down,  walking,  or  waving. 

Our  approach  to  event  recognition  is  based  on  the 
video  database  indexing  work  of  Courtney  [1997], 
which  introduced  the  use  of  predicates  on  the  mo¬ 
tion  graph  to  represent  events.  Motion  graphs  are 
well  suited  to  representing  abstract,  generic  events 
such  as  ‘depositing  an  object’  or  ‘coming  to  rest’, 
which  are  difficult  to  capture  using  the  pattern- 
based  approaches  referred  to  above.  On  the  other 
hand,  pattern-based  approaches  can  represent  com¬ 
plex  motions  such  as  ‘throwing  an  object’  or 
‘waving’,  which  would  be  difficult  to  express  using 
motion  graphs.  It  is  likely  that  both  pattern-based 
and  abstract  event  recognition  techniques  will  be 
needed  to  handle  the  full  range  of  events  that  are  of 
interest  in  surveillance  applications. 

3  AVS  Tracking  and  Event  Recognition 
Algorithms 

This  section  describes  the  core  technologies  that 
provide  the  video  surveillance  and  monitoring  ca¬ 
pabilities  of  the  AVS  system.  There  are  three  key 
technologies:  moving  object  detection,  visual 
tracking,  and  event  recognition.  The  moving  object 
detection  routines  determine  when  one  or  more  ob¬ 
jects  enter  a  monitored  scene,  decide  which  pixels 
in  a  given  video  frame  correspond  to  the  moving 
objects  versus  which  pixels  correspond  to  the  back¬ 
ground,  and  form  a  simple  representation  of  the 
object’s  image  in  the  video  frame.  This  representa¬ 
tion  is  referred  to  as  a  motion  region,  and  it  exists 
in  a  single  video  frame,  as  distinguished  from  the 
world  objects  which  exist  in  the  world  and  give  rise 
to  the  motion  regions. 


Difference  Image  Thresholded  Image 

for  moving  object  detection. 

Visual  tracking  consists  of  determining  correspon¬ 
dences  between  the  motion  regions  over  a 
sequence  of  video  frames,  and  maintaining  a  single 
representation,  or  track,  for  the  world  object  which 
gave  rise  to  the  sequence  of  motion  regions  in  the 
sequence  of  frames.  Finally,  event  recognition  is  a 
means  of  analyzing  the  collection  of  tracks  in  order 
to  identify  events  of  interest  involving  the  world 
objects  represented  by  the  tracks. 

3.1  Moving  Object  Detection 

The  moving  object  detection  technology  we  em¬ 
ploy  is  a  2D  change  detection  technique  similar  to 
that  described  in  Jain  et  al.  [1979]  and  Yalaman- 
chili  et  al  [1982].  Prior  to  activation  of  the 
monitoring  system,  an  image  of  the  background, 
i.e.,  an  image  of  the  scene  which  contains  no  mov¬ 
ing  or  otherwise  interesting  objects,  is  captured  to 
serve  as  the  reference  image.  When  the  system  is  in 
operation,  the  absolute  difference  of  the  current 
video  frame  from  the  reference  image  is  computed 
to  produce  a  difference  image.  The  difference  im¬ 
age  is  then  thresholded  at  an  appropriate  value  to 
obtain  a  binary  image  in  which  the  “off’  pixels  rep¬ 
resent  background  pixels,  and  the  “on”  pixels 
represent  “moving  object”  pixels.  The  four-con¬ 
nected  components  of  moving  object  pixels  in  the 
thresholded  image  are  the  motion  regions  (see  Fig¬ 
ure  1). 

Simple  application  of  the  object  detection  proce¬ 
dure  outlined  above  results  in  a  number  of  errors, 
largely  due  to  the  limitations  of  thresholding.  If  the 
threshold  used  is  too  low,  camera  noise  and  shad¬ 
ows  will  produce  spurious  objects;  whereas  if  the 
threshold  is  too  high,  some  portions  of  the  objects 
in  the  scene  will  fail  to  be  separated  from  the  back- 


ground,  resulting  in  breakup ,  in  which  a  single 
world  object  gives  rise  to  several  motion  regions 
within  a  single  frame.  Our  general  approach  is  to 
allow  breakup,  but  use  grouping  heuristics  to 
merge  multiple  connected  components  into  a  single 
motion  region  and  maintain  a  one-to-one  corre¬ 
spondence  between  motion  regions  and  world 
objects  within  each  frame. 

One  grouping  technique  we  employ  is  2D  morpho¬ 
logical  dilation  of  the  motion  regions.  This  enables 
the  system  to  merge  connected  components  sepa¬ 
rated  by  a  few  pixels,  but  using  this  technique  to 
span  large  gaps  results  in  a  severe  performance 
degradation.  Moreover,  dilation  in  the  image  space 
may  result  in  incorrectly  merging  distant  objects 
which  are  nearby  in  the  image  (a  few  pixels),  but 
are  in  fact  separated  by  a  large  distance  in  the 
world  (a  few  feet). 

If  3D  information  is  available,  the  connected  com¬ 
ponent  grouping  algorithm  makes  use  of  an 
estimate  of  the  size  (in  world  coordinates)  of  the 
objects  in  the  image.  The  bounding  boxes  of  the 
connected  components  are  expanded  vertically  and 
horizontally  by  a  distance  measured  in  feet  (rather 
than  pixels),  and  connected  components  with  over¬ 
lapping  bounding  boxes  are  merged  into  a  single 
motion  region.  The  technique  for  estimating  the 
size  of  the  objects  in  the  image  is  described  in  sec¬ 
tion  3.4  below. 

3.2  Tracking 

The  function  of  the  AVS  tracking  routine  is  to  es¬ 
tablish  correspondences  between  the  motion 
regions  in  the  current  frame  and  those  in  the  previ¬ 
ous  frame.  We  use  the  technique  of  Courtney 
[1997],  which  proceeds  as  follows.  First  assume 
that  we  have  computed  2D  velocity  estimates  for 
the  motion  regions  in  the  previous  frame.  These  ve¬ 
locity  estimates,  together  with  the  locations  of  the 
centroids  in  the  previous  frame,  are  used  to  project 
the  locations  of  the  centroids  of  the  motion  regions 
into  the  current  frame.  Then,  a  mutual  nearest- 
neighbor  criterion  is  used  to  establish 
correspondences. 

Let  P  be  the  set  of  motion  region  centroid  loca¬ 
tions  in  the  previous  frame,  with  p.  one  such 
location.  Let  p  ■  be  the  projected  location  of  p.  in 


the  current  frame,  and  let  be  the  set  of  all  such 
projected  locations  in  the  current  frame.  Let  C  be 
the  set  of  motion  region  centroid  locations  in  the 
current  frame.  If  the  distance  between  p' .  and 
ci  e  C  is  the  smallest  for  all  elements  of  d,  and 
this  distance  is  also  the  smallest  of  the  distances 
between  c-  and  all  elements  of  P'  (i.e.,  p’ .  and  c- 
are  mutual  nearest  neighbors),  then  establish  a  cor¬ 
respondence  between  p.  and  c •  by  creating  a 
bidirectional  strong  link  between  them.  Use  the  dif¬ 
ference  in  time  and  space  between  p .  and  c-  to 
determine  a  velocity  estimate  for  c  ■ ,  expressed  in 
pixels  per  second.  If  there  is  an  existing  track  con¬ 
taining  p. ,  add  c •  to  it.  Otherwise,  establish  anew 
track,  and  add  both  pi  and  c.  to  it. 

The  strong  links  form  the  basis  of  the  tracks  with  a 
high-confidence  of  their  correctness.  Video  objects 
which  do  not  have  mutual  nearest  neighbors  in  the 
adjacent  frame  may  fail  to  form  correspondences 
because  the  underlying  world  object  is  involved  in 
an  event  (e.g.,  enter,  exit,  deposit,  remove).  In  or¬ 
der  to  assist  in  the  identification  of  these  events, 
objects  without  strong  links  are  given  unidirection¬ 
al  weak  links  to  the  their  (non-mutual)  nearest 
neighbors.  The  weak  links  represent  potential  am¬ 
biguity  in  the  tracking  process. The  motion  regions 
in  all  of  the  frames,  together  with  their  strong  and 
weak  links,  form  a  motion  graph. 

Figure  2  depicts  a  sample  motion  graph.  In  the  fig¬ 
ure,  each  frame  is  one-dimensional,  and  is 
represented  by  a  vertical  line  (FO  -  FI  8).  Circles 
represent  objects  in  the  scene,  the  dark  arrows  rep¬ 
resent  strong  links,  and  the  gray  arrows  represent 
weak  links.  An  object  enters  the  scene  in  frame  FI, 
and  then  moves  through  the  scene  until  frame  F4, 
where  it  deposits  a  second  object.  The  first  object 
continues  to  move  through  the  scene,  and  exits  at 
frame  F6.  The  deposited  object  remains  stationary. 
At  frame  F8  another  object  enters  the  scene,  tem¬ 
porarily  occludes  the  stationary  object  at  frame 
F10  (or  is  occluded  by  it),  and  then  proceeds  to 
move  past  the  stationary  object.  This  second  mov¬ 
ing  object  reverses  directions  around  frames  FI 3 
and  FI 4,  returns  to  remove  the  stationary  object  in 
frame  FI 6,  and  finally  exits  in  frame  F17.  An  addi¬ 
tional  object  enters  in  frame  F5  and  exits  in  frame 
F8  without  interacting  with  any  other  object. 

As  indicated  by  the  striped  fill  patterns  in  Figure  2, 
the  correct  correspondences  for  the  tracks  are  am- 
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Figure  2:  Event  detection  in  the  motion  graph. 


biguous  after  object  interactions  such  as  the 
occlusion  in  frame  FlO.  The  AVS  system  resolves 
this  ambiguity  where  possible  by  preferring  to 
match  moving  objects  with  moving  objects,  and 
stationary  objects  with  stationary  objects.  The  dis¬ 
tinction  between  moving  and  stationary  tracks  is 
computed  using  thresholds  on  the  velocity  esti¬ 
mates,  and  hysteresis  for  stabilizing  transitions 
between  moving  and  stationary. 

Following  an  occlusion  (which  may  last  for  several 
frames)  the  frames  immediately  before  and  after 
the  occlusion  are  compared  (e.g.,  frames  F9  and 
Fll  in  Figure  2).  The  AVS  system  examines  each 
stationary  object  in  the  pre-occlusion  frame,  and 
searches  for  its  correspondent  in  the  post-occlusion 
frame  (which  should  be  exactly  where  it  was  be¬ 
fore,  since  the  object  is  stationary).  This  procedure 
resolves  a  large  portion  of  the  tracking  ambiguities. 
General  resolution  of  ambiguities  resulting  from 
multiple  moving  objects  in  the  scene  is  a  topic  for 
further  research.  The  AVS  system  may  benefit 
from  inclusion  of  a  “closed  world  tracking”  facility 
such  as  that  described  by  Intille  and  Bobick 
[1995a,  1995b]. 

3.3  Event  Recognition 

Certain  features  of  tracks  and  pairs  of  tracks  corre¬ 
spond  to  events.  For  example,  the  beginning  of  a 
track  corresponds  to  an  ENTER  event,  and  the  end 
corresponds  to  an  EXIT  event.  In  an  on-line  event 
detection  system,  it  is  preferable  to  detect  the  event 


as  near  in  time  as  possible  to  the  actual  occurrence 
of  the  event.  The  previous  system  which  used  mo¬ 
tion  graphs  for  event  detection  [Courtney,  1997] 
operated  in  a  batch  mode,  and  required  multiple 
passes  over  the  motion  graph,  precluding  on-line 
operation.  The  AVS  system  detects  events  in  a  sin¬ 
gle  pass  over  the  motion  graph,  as  the  graph  is 
created.  However,  in  order  to  reduce  errors  due  to 
noise,  the  AVS  system  introduces  a  slight  delay  of 
n  frame  times  (n=3  in  the  current  implementation) 
before  reporting  certain  events.  For  example,  in 
Figure  2,  an  enter  event  occurs  on  frame  FI.  The 
AVS  system  requires  the  track  to  be  maintained  for 
n  frames  before  reporting  the  enter  event.  If  the 
track  not  maintained  for  the  required  number  of 
frames,  it  is  ignored,  and  the  enter  event  is  not  re¬ 
ported,  e.g.,  if  n  >  4,  the  object  in  Figure  2  which 
enters  in  frame  F5  and  exits  in  frame  F8  will  not 
generate  any  events. 

A  track  that  splits  into  two  tracks,  one  of  which  is 
moving,  and  the  other  of  which  is  stationary,  corre¬ 
sponds  to  a  DEPOSIT  event.  If  a  moving  track 
intersects  a  stationary  track,  and  then  continues  to 
move,  but  the  stationary  track  ends  at  the  intersec¬ 
tion,  this  corresponds  to  a  REMOVE  event.  The 
remove  event  can  be  generated  as  soon  as  the  re¬ 
mover  disoccludes  the  location  of  the  stationary 
object  which  was  removed,  and  the  system  can  de¬ 
termine  that  the  stationary  object  is  no  longer  at 
that  location. 


In  a  manner  similar  to  the  occlusion  situation  de¬ 
scribed  above  in  section  3.2,  the  deposit  event  also 
gives  rise  to  ambiguity  as  to  which  object  is  the  de¬ 
positor,  and  which  is  the  depositee.  For  example,  it 
may  have  been  that  the  object  which  entered  at 
frame  FI  of  Figure  2  stopped  at  frame  F4  and  de¬ 
posited  a  moving  object,  and  it  is  the  deposited 
object  which  then  proceeded  to  exit  the  scene  at 
F6.  Again,  the  AVS  system  relies  on  a  moving  vs. 
stationary  distinction  to  resolve  the  ambiguity,  and 
insists  that  the  depositee  remain  stationary  after  a 
deposit  event.  The  AVS  system  requires  both  the 
depositor  and  the  depositee  tracks  to  extend  for  n 
frames  past  the  point  at  which  the  tracks  separate 
(e.g.,  past  frame  F5  in  Figure  2),  and  that  the  de¬ 
posited  object  remain  stationary;  otherwise  no 
deposit  event  is  generated. 

Also  detected  (but  not  illustrated  in  Figure  2),  are 
REST  events  (when  a  moving  object  comes  to  a 
stop),  and  MOVE  events  (when  a  RESTing  object 
begins  to  move  again).  Finally,  one  further  event 
that  is  detected  is  the  LIGHTSOUT  event,  which 
occurs  whenever  a  large  change  occurs  over  the  en¬ 
tire  image.  The  motion  graph  need  not  be  consulted 
to  detect  this  event. 

3.4  Image  to  World  Mapping 

In  order  to  locate  objects  seen  in  the  image  with  re¬ 
spect  to  a  map,  it  is  necessary  to  establish  a 
mapping  between  image  and  map  coordinates.  This 
mapping  is  established  in  the  AVS  system  by  hav¬ 
ing  a  user  draw  quadrilaterals  on  the  horizontal 


surfaces  visible  in  an  image,  and  the  corresponding 
quadrilaterals  on  a  map,  as  shown  in  Figure  3.  A 
warp  transformation  from  image  to  map  coordi¬ 
nates  is  constructed  using  the  quadrilateral 
coordinates. 

Once  the  transformations  are  established,  the  sys¬ 
tem  can  estimate  the  location  of  an  object  (as  in 
Flinchbaugh  and  Bannon  [1994])  by  assuming  that 
all  objects  rest  on  a  horizontal  surface.  When  an 
object  is  detected  in  the  scene,  the  midpoint  of  the 
lowest  side  of  the  bounding  box  is  used  as  the  im¬ 
age  point  to  project  into  the  map  window  using  the 
quadrilateral  warp  transformation  [Wolberg,  1990]. 

4  Applications 

The  AVS  core  algorithms  described  in  section  3 
have  been  used  as  the  basis  for  several  video  sur¬ 
veillance  applications.  Section  4  describes  three 
applications  that  we  have  implemented:  situational 
awareness,  best-view  selection  for  activity  logging, 
and  environment  learning. 

4.1  Situational  Awareness 

The  goal  of  the  situational  awareness  application  is 
to  produce  a  real-time  map-based  display  of  the  lo¬ 
cations  of  people,  objects  and  events  in  a 
monitored  region,  and  to  allow  a  user  to  specify 
alarm  conditions  interactively.  Alarm  conditions 
may  be  based  on  the  locations  of  people  and  ob¬ 
jects  in  the  scene,  the  types  of  objects  in  the  scene, 
the  events  in  which  the  people  and  objects  are  in- 


Name  :  [Deposit  Briefcase  fTable  A 

Events:  J  enter  J  exit  J  loiter  9  deposit  J  remove  _l  move  J  rest  J  lightsout  J  lightson 

Objects:  J  person  J  box  9  briefcase  J  notebook  J  monitor  object  J  unknown 

Days  of  week:  9  Monday  9  Tuesday  9  Wednesday  9  Thursday  9  Friday  J  Saturday  J  Sunday 
Time  of  day:  from  S:00  pm  —  until  7:00  am  — ■ 

Regions:  J  Table_B  _]  Table_C  9  Table_A 

Duration:  F“ 

Actions:  J  beep  J  popup  J  log  _1  plot  9  voice 

cancel  I  OK  I 


Figure  5:  User  interface  for  specifying  a  monitor  in  AVS 


volved,  and  the  times  at  which  the  events  occur. 
Furthermore,  the  user  can  specify  the  action  to  take 
when  an  alarm  is  triggered,  e.g.,  to  generate  an  au¬ 
dio  alarm  or  write  a  log  file.  For  example,  the  user 
should  be  able  to  specify  that  an  audio  alarm 
should  be  triggered  if  a  person  deposits  a  briefcase 
on  a  given  table  between  5:00pm  and  7:00  am  on  a 
weeknight. 


The  architecture  of  the  AVS  situational  awareness 
system  is  depicted  in  Figure  4.  The  system  consists 
of  one  or  more  smart  cameras  communicating  with 
a  Video  Surveillance  Shell  (VSS).  Each  camera  has 
associated  with  it  an  independent  AVS  core  engine 
that  performs  the  processing  described  in  section  3. 
That  is,  the  engine  finds  and  tracks  moving  objects 
in  the  scene,  maps  their  image  locations  to  world 
coordinates,  and  recognizes  events  involving  the 
objects.  Each  core  engine  emits  a  stream  of  loca¬ 
tion  and  event  reports  to  the  VSS,  which  filters  the 
incoming  event  streams  for  user-specified  alarm 
conditions  and  takes  the  appropriate  actions. 


object  recognition 


Figure  4:  The  situational  awareness  system 


In  order  to  determine  the  identities  of  objects  (e.g., 
briefcase,  notebook),  the  situational  awareness  sys¬ 
tem  communicates  with  one  or  more  object 
analysis  modules  (OAMs).  The  core  engines  cap¬ 
ture  snapshots  of  interesting  objects  in  the  scenes, 
and  forward  the  snapshots  to  the  OAM,  along  with 
the  IDs  of  the  tracks  containing  the  objects.  The 
OAM  then  processes  the  snapshot  in  order  to  deter¬ 
mine  the  type  of  object.  The  OAM  processing  and 
the  AVS  core  engine  computations  are  asynchro¬ 
nous,  so  the  core  engine  may  have  processed 
several  more  frames  by  time  the  OAM  completes 
its  analysis.  Once  the  analysis  is  complete,  the 
OAM  sends  the  results  (an  object  type  label)  and 
the  track  ID  back  to  the  core  engine.  The  core  en¬ 
gine  uses  the  track  ID  to  associate  the  label  with 
the  correct  object  in  the  current  frame  (assuming 
the  object  has  remained  in  the  scene  and  been  suc¬ 
cessfully  tracked). 


The  VSS  provides  a  map  display  of  the  monitored 
area,  with  the  locations  of  the  objects  in  the  scene 
reported  as  icons  on  the  map.  The  VSS  also  allows 
the  user  to  specify  alarm  regions  and  conditions. 
Alarm  regions  are  specified  by  drawing  them  on 
the  map  using  a  mouse,  and  naming  them  as  de¬ 
sired.  The  user  can  then  specify  the  conditions  and 
actions  for  alarms  by  creating  one  or  more  moni¬ 
tors.  Figure  5  depicts  the  monitor  creation  dialog 
box.  The  user  names  the  monitor  and  uses  the 
mouse  to  select  check  boxes  associated  with  the 
conditions  that  will  trigger  the  monitor.  The  user 
selects  the  type  of  event,  the  type  of  object  in¬ 
volved  in  the  event,  the  day  of  week  and  time  of 
day  of  the  event,  where  the  event  occurs,  and  what 
to  do  when  the  alarm  condition  occurs.  The  moni¬ 
tor  specified  in  Figure  5  specifies  that  a  voice  alarm 
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Figure  6:  Tracking  an  object  in  the  scene  on  the  map 


will  be  sounded  when  a  briefcase  is  deposited  on 
Table_A  between  5:00pm  and  7:00am  on  a  week- 
night.  The  voice  alarms  are  customized  to  the  event 
and  object  type,  so  that  when  this  alarms  is  trig¬ 
gered,  the  system  will  announce  “deposit  box”  via 
its  audio  output.  Figure  6  shows  a  person  about  to 
trigger  this  alarm. 

5  Best-Mew  Selection  for  Activity  Logging 

In  many  video  surveillance  applications  the  goal  of 
surveillance  is  not  to  detect  events  in  real  time  and 
generate  alarms,  but  rather  to  construct  a  log  or  au¬ 
dit  trail  of  all  of  the  activity  that  takes  place  in  the 
camera’s  field  of  view.  This  log  is  examined  by  in¬ 
vestigators  after  a  security  incident  (e.g.,  a  theft  or 
terrorist  attack),  and  is  used  to  identify  possible 
suspects  or  witnesses. 

In  order  to  gain  experience  with  this  type  of  appli¬ 
cation,  we  have  used  the  tracking  and  event 
detection  capabilities  described  in  section  3  to  con¬ 
struct  a  program  that  monitors  and  records  the 
movements  of  humans  in  its  field  of  view.  For  ev¬ 
ery  person  that  it  sees,  it  creates  a  log  file  that 
summarizes  important  information  about  the  per¬ 
son,  including  a  snapshot  taken  when  the  person 
was  close  to  the  camera  and  (if  possible)  facing  it. 
The  log  files  are  made  available  to  authorized  users 
via  the  World-Wide  Web. 


5.1  Architecture 

The  application  makes  use  of  the  AVS  core  algo¬ 
rithms  to  detect  and  track  people.  Upon  detection 
of  a  track  corresponding  to  a  person  in  the  input, 
the  tracker  associates  a  data  record  with  the  track. 
The  data  record  contains  a  summary  of  information 
about  the  person,  including  a  snapshot  extracted 
from  the  current  video  image.  As  the  person  is 
tracked  through  the  scene,  the  tracker  examines 
each  image  of  that  person  that  it  receives.  If  the 
new  image  is  a  better  view  of  the  person  than  the 
previously  saved  snapshot,  the  snapshot  is  replaced 
with  the  new  view.  When  the  person  leaves  the 
scene,  the  data  record  is  saved  to  a  file. 

Each  log  entry  file  records  the  time  when  the  per¬ 
son  entered  the  scene  and  a  list  of  coordinate  pairs 
showing  their  position  in  each  video  frame.  Each 
log  entry  file  also  contains  the  snapshot  that  was 
stored  in  the  track  record  for  the  person  when  they 
exited  the  scene.  Because  of  the  way  snapshots  are 
maintained,  the  final  snapshot  is  the  best  view  of 
the  person  that  the  system  had  during  tracking.  Fi¬ 
nally,  the  log  entry  file  contains  a  pointer  to  the 
reference  image  that  was  in  effect  when  the  snap¬ 
shot  was  taken.  This  information  forms  an 
extremely  concise  description  of  the  person’s 
movements  and  appearance  while  they  were  in  the 
scene. 

Selecting  the  best  view:  The  system  uses  simple 
heuristics  to  decide  when  the  current  view  of  a  per- 
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Figure  7:  Floor  plan  of  area  used  for  hallway  monitoring  experiments.  Camera  is  located  at  right  and 

monitors  the  hallway  and  printer  alcove. 


son  is  better  than  the  previously  saved  view.  First, 
the  new  view  is  considered  better  if  the  subject  is 
moving  toward  the  camera  in  the  current  frame, 
and  was  moving  away  in  the  previously  saved 
view.  This  causes  the  system  to  favor  views  in 
which  the  subject’s  face  is  visible.  If  this  rule  does 
not  apply,  the  new  view  is  considered  better  if  the 
subject  appears  to  be  larger  (subtends  a  larger  visu¬ 
al  angle).  This  causes  the  system  to  prefer  views  in 
which  the  subject  is  close  to  the  camera.  Other  pos¬ 
sible  view  selection  heuristics  are  discussed  in 
Kelly  et  al.  [1995]. 


tries.  Log  entries  are  displayed  by  a  Java  applet  that 
displays  the  best  snapshot  of  the  person  in  the  con¬ 
text  of  the  reference  image,  and  overlays  the 
person’s  path  through  the  scene  on  the  image.  The 
applet  runs  as  an  independent  thread  that  checks 
periodically  to  see  if  any  new  log  entries  have  been 
created.  Thus  if  the  user  is  browsing  the  entries  for 
the  current  day,  new  entries  become  available  to 
the  browser  as  soon  as  they  occur. 

5.2  Experiments 


Handling  background  change:  The  test  environ¬ 
ment  experiences  significant  lighting  variation 
during  the  day  due  to  window  lighting,  opening 
and  closing  doors  etcetera.  In  addition,  during  the 
day  people  frequently  deposit,  remove,  or  reposi¬ 
tion  objects  in  the  scene.  This  creates  permanent 
regions  of  difference  between  the  scene  and  the 
reference  image.  Without  some  mechanism  for  up¬ 
dating  the  reference  image,  the  system  would 
continue  to  track  these  difference  regions  as  ob¬ 
jects.  Therefore,  the  tracker  was  instructed  to 
discard  the  current  tracks  and  grab  a  new  reference 
image  whenever  it  determined  that  all  objects  in 
the  scene  were  stationary,  and  that  no  object  had 
moved  for  several  seconds. 

User  Interface 

Log  files  are  saved  in  a  directory  tree  associated 
with  the  camera  that  produced  the  data.  Along  with 
the  log  files,  the  monitoring  application  creates 
HTML  documents  that  allow  a  web  browser  to 
navigate  the  directory  tree  and  access  the  log  en- 


The  system  described  above  was  tested  in  a  hall¬ 
way  of  our  laboratory.  Figure  7  shows  the  hallway 
floor  plan.  The  camera  is  mounted  in  the  hallway 
ceiling  and  looks  west  toward  a  window-lit  corri¬ 
dor  that  runs  around  the  perimeter  of  the  building. 
The  hallway  experiences  heavy  traffic,  because  it 
contains  a  laser  printer,  a  copier,  and  the  office  wa¬ 
ter  cooler.  The  hallway  passes  under  the  camera 
and  continues  to  the  east  out  of  the  field  of  view. 

The  system  was  allowed  to  run  for  a  total  of  1 1 8 
hours  over  a  period  of  a  week.  Most  laboratory  per¬ 
sonnel  were  unaware  that  a  test  was  in  progress,  so 
the  system  was  exposed  to  normal  daily  activity. 
During  the  test  the  system  recorded  a  total  of  965 
log  entries.  Figure  8  shows  the  browser  display  for 
a  typical  log  entry.  In  this  sequence  the  subject  en¬ 
tered  the  scene  from  the  cross  corridor  at  rear  and 
came  down  the  hallway  on  his  way  to  the  copier, 
out  of  view  at  lower  right.  His  path  is  shown  as  a 
line  on  the  floor,  which  appears  red  when  viewed 
with  a  color  browser. 


Figure  8:  Log  entry  browser  interface.  The  line  drawn  on  the  floor  in  the  upper  image  shows  the  sub¬ 
ject’s  path  from  entry  to  exit.  The  list  entry  selected  at  left  is  the  time  at  which  the  image  was  taken. 


Figure  9  demonstrates  the  effect  of  the  system’s  and  took  a  few  steps  back  toward  the  camera,  then 

preference  for  frontal  views.  In  this  sequence  the  turned  away  again  and  continued  down  the  hall- 

subject  entered  at  the  bottom  of  the  scene  and  way,  eventually  exiting  via  the  first  door  on  the  left, 

walked  away  from  the  camera.  He  turned  around  Although  the  subject’s  back  was  toward  the  camera 


most  of  the  time,  the  view  preference  heuristics  se¬ 
lected  a  view  taken  while  he  was  facing  the 
camera. 

Performance  Evaluation 

In  order  to  assess  the  performance  of  the  monitor¬ 
ing  application,  all  of  the  log  entries  for  the 
experiment  period  were  examined  and  scored  by 
one  of  the  authors.  Entries  were  classified  as 
follows: 

Face/Non-face:  Entries  containing  a  view  of  a  sub¬ 
ject’s  head  were  classified  as  FACES  if  the 
subject’s  face  (specifically,  subject’s  nose)  was  vis¬ 
ible,  otherwise  they  were  classified  as 
NONFACES. 

False  Alarm:  Images  which  contained  no  human 
and  appeared  to  be  caused  by  noise  were  classified 
as  FALSE  ALARMS. 

Bad  Path:  Entries  in  which  the  floor  trace  is  clear¬ 
ly  corrupt  in  some  way  were  classified  as  BAD 
PATHs. 


Bad  Choice:  In  some  cases  it  is  clear  from  the 
floor  trace  that  the  system  made  a  poor  choice  of 
which  image  of  a  person  to  save  in  the  log  entry. 
These  entries  were  classified  as  BAD  CHOICE. 

False  Negative:  In  some  cases  it  is  clear  that  the 
system  failed  to  take  a  usable  picture  of  a  person 
who  was  in  the  scene.  These  were  classified  as 
FALSE  NEGATIVES.  About  half  of  the  false  neg¬ 
atives  occurred  when  the  system  selected  a  view  in 
which  the  subject’s  head  is  not  visible,  typically 
because  they  were  in  the  act  of  passing  through  a 
doorway.  The  others  occurred  when  the  system  be¬ 
came  confused  by  occlusion,  and  incorrectly 
grouped  two  people  into  a  single  log  entry.  Note 
that  we  do  not  have  ground  truth  for  the  observa¬ 
tion  period,  so  there  may  have  been  other  detection 
failures  that  were  not  detected.  However,  monitor¬ 
ing  by  the  authors  during  the  daytime  revealed  no 
failures  of  this  type.  We  believe  that  the  FALSE 
NEGATIVE  count  is  a  good  estimate  of  the  num¬ 
ber  of  detection  failures. 

Table  1  shows  the  classification  counts  for  the  test 
period.  Assuming  that  the  false  negative  count  is 
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Figure  9:  Log  entry  showing  the  effect  of  the  view  selection  heuristic  preference  for  frontal  views.  The 
subject  was  walking  away  from  the  camera  for  most  of  this  sequence,  but  the  system  was  able  to  cap¬ 
ture  a  view  while  he  was  facing  the  camera. 


Table  1:  Long-term  monitoring  system 
performance 


log  entry  type 

Number  of 
entries 

FACE 

493 

NONFACE 

380 

FALSE  ALARM 

62 

FALSE  NEGATIVE 

44 

BAD  PATH 

112 

BAD  CHOICE 

29 

TOTAL  ENTRIES 

965 

valid,  the  system  achieved  a  detection  rate  of 
95.2%  with  a  false  alarm  rate  of  6.4%.  The  record¬ 
ed  path  of  the  subject  was  correct  (or  at  least 
plausible)  in  88.4%  of  entries,  and  the  system 
made  conspicuously  bad  choices  of  what  image  to 
save  in  only  3%  of  entries. 

Of  the  valid  images  of  humans,  56.6%  showed  the 
subject’s  face,  vs.  43.4%  that  did  not.  Note  that  in 
most  cases  where  the  image  does  not  show  the 
face,  the  subject  entered  the  scene  from  below  the 
camera  and  walked  away  from  it,  so  there  was  nev¬ 
er  an  opportunity  for  a  frontal  view.  Earlier 
experiments  without  the  frontal  view  heuristic  cap¬ 
tured  FACE  and  NONFACE  images  with  roughly 
equal  frequency,  so  the  it  is  clear  that  the  heuristic 
helps. 

At  the  end  of  the  experiment,  the  camera  directory 
occupied  34.5  megabytes,  or  about  seven  mega¬ 
bytes  per  day  of  monitoring.  Almost  all  of  the 
storage  consists  of  image  files,  so  presumably  com¬ 
pression  with  an  image-specific  algorithm  would 
produce  substantial  savings.  Use  of  an  MPEG-like 
algorithm  on  the  reference  images  would  be  ex¬ 
tremely  effective,  since  the  reference  images  are  all 
very  nearly  identical,  and  lossless  compression 
would  not  be  necessary. 


6  Learning  Environment  Structure 

The  AVS  tracking  and  event  recognition  software 
uses  corresponding  rectangles  in  image  and  world 
coordinates  to  compute  an  approximate  image-to- 
world  mapping.  These  rectangles  are  created  by  a 
human  when  the  camera  system  is  set  up.  In  many 
situations  it  would  be  preferable  to  eliminate  even 
this  minimal  calibration  step,  in  order  to  reduce 
setup  cost  to  a  minimum. 

We  have  developed  a  system  that  learns  the  image- 
to- world  mapping  by  watching  humans  move 
around  in  the  scene.  Changes  in  the  apparent  size 
and  position  of  humans  in  the  image  provide  infor¬ 
mation  about  the  existence  and  depth  of  world 
surfaces.  Appearance  and  disappearance  of  hu¬ 
mans  provides  information  about  occlusion 
boundaries  and  locations  where  humans  can  enter 
or  exit  the  scene. 

6.1  Method 

The  computation  assumes  weak  perspective  pro¬ 
jection,  i.e.  that  objects  in  the  scene  are  first 
projected  orthographically  to  a  plane  passing 
through  a  reference  point  on  the  object  and  parallel 
to  the  image  plane,  and  then  projected  to  the  image 
plane  using  true  perspective.  It  is  also  assumed  that 
humans  are  usually  in  contact  with  a  world  surface 
that  supports  them,  that  the  camera  is  in  an  upright 
position  (has  roll  angle  zero),  and  that  the  internal 
calibration  parameters  of  the  camera  are  known. 

More  precisely,  assume  front  projection  with  the 
camera  focal  point  at  the  origin  and  looking  down 
the  Z  axis  of  a  left-handed  coordinate  system.  Sup¬ 
pose  the  camera  observes  a  person  in  the  world 
with  head  at  world  point  ¥  =  {XH,  YH,  ZH)  and  feet 

at  world  point  F .  Let  F  be  the  reference  point  for 
weak  perspective  projection.  Then  the  apparent 
height  of  the  person  in  the  image  is  given  by 

where  0  is  the  camera  tilt  angle  relative  to  the  lo¬ 
cal  vertical  direction.  Solving  for  depth  gives 


The  person’s  height  \H  -  F|  has  a  known  probabili¬ 
ty  distribution,  and  the  tilt  angle  term  cosQ  can  be 


Figure  10:  Apparent  height  data  collected  in  the 
experiment.  Cell  intensity  is  the  median  of  the 
image  heights  of  observed  humans  when  their 
feet  were  imaged  in  the  cell.  Dark  grey  regions 
contain  no  data. 

estimated  from  the  appearance  of  the  person,  or 
simply  ignored  for  the  shallow  tilt  angles  typical  of 
security  camera  installations.  Given  enough  obser¬ 
vations,  the  equation  can  be  used  to  estimate  the 
distance  from  the  camera  to  points  in  the  world 
where  people  commonly  walk. 

The  idea  of  recovering  structure  from  observed  siz¬ 
es  of  humans  is  conceptually  related  to  shape- 
from-texture  work  in  which  the  texture  is  made  up 
of  discrete  elements  that  are  uniform  in  size  and 
shape  [Aloimonos  and  Swain,  1988,  Blostein  and 
Ahuja,  1989].  In  this  case  the  texels  (people)  do  not 
lie  in  the  imaged  surface,  and  their  size  in  the 
world  is  known.  This  makes  depth  recovery  sub¬ 
stantially  easier  than  it  is  in  general  shape-from- 
texture  work. 

6.2  Mapping  the  Environment 

The  equation  derived  above  has  been  used  in  a  pro¬ 
gram  that  learns  the  structure  of  its  environment  by 
watching  humans  move  around  in  it.  The  program 
makes  use  of  the  AVS  core  algorithms  to  detect  and 
track  people.  Over  time,  it  builds  up  an  image  in 
which  pixel  value  represents  depth  to  the  nearest 
world  surface  in  the  corresponding  direction. 

The  camera  image  is  partitioned  into  a  grid  of 
16x1 6-pixel  squares,  each  of  which  is  associated 
with  a  histogram.  Whenever  the  program  detects  a 
person  in  the  scene,  it  locates  the  histogram  associ¬ 


ated  with  the  place  where  they  are  standing,  i.e., 
the  one  associated  with  the  square  containing  the 
bottom  center  of  the  motion  region  for  the  person. 
The  apparent  height  of  the  person  is  recorded  in 
that  histogram.  Over  time,  the  histogram  for  each 
location  in  the  image  builds  up  a  sample  distribu¬ 
tion  for  the  apparent  (image)  height  of  humans  at 
that  location.  This  can  be  used  with  the  equation 
derived  previously  to  estimate  the  depth  at  that 
point. 

The  program  was  allowed  to  operate  for  twenty- 
four  hours  during  a  typical  working  day.  Input  was 
provided  by  the  hallway  camera  used  in  section  5. 
Figure  10  shows  the  raw  output  of  the  program.  In 
the  figure  pixel  intensity  corresponds  to  the  median 
observed  height  for  the  corresponding  location. 
Dark  grey  pixels  are  those  for  which  no  observa¬ 
tions  were  recorded.  The  program  was  instructed  to 
discard  observations  in  which  the  motion  region 
for  the  person  touched  the  upper  or  lower  image 
border,  since  the  apparent  height  is  invalid  in  that 
condition.  For  this  reason,  there  are  no  counts  for 
the  end  of  the  hallway. 

The  height  data  of  Figure  10  were  converted  to 
depths  using  the  equation  derived  above.  Vertical 
pixel  pitch  was  taken  from  the  camera  technical 
manual,  and  the  nominal  lens  focal  length  was  used 
to  approximate  the  true  focal  length.  Histogram 
cells  for  which  fewer  than  ten  total  observations 
were  recorded  were  discarded. 

Figure  1 1  shows  the  final  depth  map  superimposed 
on  the  image.  The  range  estimates  cover  image  re¬ 
gions  corresponding  to  the  floor,  and  vary 
smoothly  over  most  of  the  image.  Anomalously 
large  values  occur  in  several  locations  at  right  cen¬ 
ter  below  the  small  printer  and  workstation.  These 
errors  occur  because  the  office  chair  is  frequently 
moved  around  in  this  region,  and  the  system  some¬ 
times  mistakes  it  for  a  person.  Since  it  is 
significantly  smaller  than  a  real  person,  the  system 
interprets  it  as  evidence  that  the  floor  supporting  it 
is  further  away  than  it  actually  is.  A  similar  prob¬ 
lem  produces  the  anomalously  high  value  of  8.9 
meters  at  left  center,  at  the  base  of  the  doorway.  It 
frequently  happens  that  as  a  person  exits  the  hall 
via  the  doorway,  their  head  goes  out  of  sight  while 
their  body  and  feet  are  still  visible.  The  system 
records  the  height  of  the  visible  portion  of  the  per¬ 
son  in  the  cell  at  the  base  of  the  doorway.  Since  this 


Figure  11:  Depth  map  recovered  from  the  height  data  of  figure  10.  Depths  are  in  meters. 
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Figure  12:  Ground  truth  range  values  for  comparison  with  figure  11. 


value  is  smaller  than  the  true  height  of  the  person, 
that  cell  appears  to  be  further  away  than  it  really  is. 


In  order  to  assess  the  accuracy  of  the  recovered 
depth  map,  we  measured  the  distance  from  the 
camera  to  seven  points  on  the  floor.  The  seven 
points  and  their  distances  from  the  camera  are 


shown  superimposed  on  the  image  in  figure  12.  Ta¬ 
ble  2  shows  the  estimated  and  actual  ranges  to  the 
test  points,  as  well  as  the  error  in  meters.  The  aver¬ 
age  absolute  error  for  the  seven  test  points  is  26cm, 
which  is  less  than  5%  of  the  average  distance. 


Table  2:  Estimated  vs.  Actual  Range 
(meters)  to  ground  truth  points 


point 

estimate 

(meters) 

actual 

(meters) 

error 

(meters) 

A 

4.70 

4.80 

-0.10 

B 

5.00 

5.40 

-0.40 

C 

5.90 

5.89 

0.01 

D 

6.10 

6.45 

-0.35 

E 

6.80 

7.26 

-0.46 

F 

7.70 

8.18 

-0.48 

G 

9.80 

9.85 

-0.05 

7  Conclusion 


The  goal  of  our  research  is  to  develop  algorithms 
and  systems  that  can  be  used  to  describe  a  video 
sequence  in  terms  of  moving  objects  and  events. 
These  algorithms  will  enable  a  generation  of  smart 
cameras  that  deliver  information  about  scenes  rath¬ 
er  than  raw  images.  We  have  created  a  set  of  core 
algorithms  comprising  the  Autonomous  Video  Sur¬ 
veillance  (AVS)  system,  including  routines  for 
moving  object  detection,  tracking,  and  abstract 
event  recognition.  The  AVS  system  has  been  used 
to  create  severed  surveillance  applications,  includ¬ 
ing  a  video  surveillance  shell,  a  program  that 
creates  concise  logs  of  activity  in  the  field  of  view, 
and  a  program  that  learns  scene  structure  by  watch¬ 
ing  humans  moving  around  in  the  environment. 

Our  future  work  on  AVS  will  address  weaknesses 
in  the  current  system,  and  will  add  new  capabilities 
that  support  more  complex  applications.  Work  is 
planned  in  three  main  areas: 

Robust  Change  Detection  and  Tracking:  Experi¬ 
ments  have  shown  that  errors  in  the  moving  object 
detection  computation  are  the  most  common  cause 
of  errors  in  our  applications.  This  is  particularly  a 
problem  in  outdoor  environments.  We  plan  to  de¬ 
velop  new  change  detection  algorithms  based  on 
dynamic  background  models  that  capture  the  way 
the  background  changes  over  time.  We  will  also 
exploit  contextual  information  to  predict  the  ex¬ 


pected  size  and  appearance  of  moving  objects  in 
the  scene. 

Improved  Event  Recognition:  We  will  extend  our 
motion-graph-based  event  recognition  algorithms 
to  a  broader  range  of  events,  and  will  develop 
methods  of  specifying  and  recognizing  compound 
events  and  event  sequences. 

Applications:  We  will  extend  the  existing  video 
surveillance  shell  to  make  use  of  authentication 
sensors,  and  to  distinguish  between  authorized  and 
unauthorized  individuals.  We  will  continue  to  use 
AVS  technology  to  develop  applications  that  ad¬ 
dress-  military  and  other  government  video 
surveillance  needs. 
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